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Abstract — We present our perspective and goals on high- 
performance computing for nanoscience in accordance with the 
global trend toward "peta-scale computing." After reviewing 
our results obtained through the grid-enabled version of the 
fragment molecular orbital method (FMO) on the grid testbed 
by the Japanese Grid Project, National Research Grid Initiative 
(NAREGI), we show that FMO is one of the best candidates for 
peta-scale applications by predicting its effective performance 
in peta-scale computers. Finally, we introduce our new project 
constructing a peta-scale application in an open-architecture 
implementation of FMO in order to realize both goals of high- 
performance in peta-scale computers and extendibility to multi- 
physics simulations. 

I. Introduction 

On account of the recent developments of the grid com- 
puting environment, we can use many computer resources 
more than a thousand CPUs. However, those large distributed 
resources are accessible only when we pass through some 
gateway to the grid system after a tedious procedure for 
grid-enabling of application programs. Thus, for scientists in 
nanoscience to use the grid system as one of the daily tools, it 
is important to make their applications grid-aware in advance. 

On the other hand, the development on high performance 
computing (HPC) environments shows no sign of slowing 
down, and, within several years, we might reach the scale of 
peta (~ 10 15 ) in computer resources for scientific computing. 
It is expected that the "peta-scale computer" exhibits more 
than a peta flops performance in floating-point calculations, 
its available memory is more than a peta byte in total, or it 
has external storages amount to more than a peta byte. Thus, 
the global trend in HPC is now to study how to realize the 
peta-scale computing [1]. 

The purpose of this paper is twofold: one is to present 
our computational results in nanoscience supported by the 
Japanese Grid project, National Research Grid Initiative 
(NAREGI) [2], and the other is to show a perspective about 
the HPC applications for peta-scale computing. This paper is 
coordinated as follows. Those computational results using the 
fragment molecular orbital method (FMO) on grid computing 
environments are shown in section [H The performance pre- 
diction of the FMO calculation in a peta-scale computer is 



Fig. 1. In FMO, the target molecule is divided into fragments for which 
ab initio calculation is performed. Usually, carbon-carbon single bonds are 
chosen as a boundary between fragments. 



shown in section [Till and the introduction of the actual im- 
plementation of FMO in order to realize peta-scale computing 
is given in section [IVJ Finally in section [VI we discuss what 
is necessary in the actual peta-scale computing for scientific 
applications. 

II. Grid-enabled FMO 

In this section, we review the results on a grid-enabled 
version of FMO, and evaluate its effectivity in the grid testbed 
of NAREGI [3]. Our main contribution to the project is to 
develop large-scale grid-enabled applications in nanoscience. 
One of these applications is the Grid-FMO which is based 
on the famous ab initio molecular orbital (MO) package 
program, GAMES S [6], and is constructed by dividing the 
package into several independent modules so that they are 
executed on a grid environment. The algorithm of FMO and 
the implementation of Grid-FMO is briefly reviewed in the 
following subsections. 

A. Algorithm of FMO 

The FMO method developed by Dr. Kitaura and co-workers 
[4] can execute all electron calculation in large molecules with 
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Fig. 2. Flow of calculation of the grid-enabled FMO 



more than 10 thousands atoms. The brief overview of the flow 
of the fragment MO method up to dimer correction is the 
following: (1) the molecule to be calculated is divided into 
fragments, (2) ab initio MO calculation [5] is executed for each 
fragment (monomer) under the electro- static potential made by 
all the other fragments, and this calculation is repeated until 
the potential is self-consistently converged, (3) ab initio MO 
calculation is executed for pairs of fragments (dimer), and 
(4) the total energy is obtained by summing up all the results. 
This algorithm of FMO is implemented in several famous MO 
packages such as GAMESS [6], ABINIT-MP [7], etc. 

Although the most time-consuming part of this calculation 
is ab initio MO calculations for each fragment and each pair of 
fragments, these calculations can be executed independently. 
Thus, FMO is easily executed in parallel computers with 
high efficiency. The FMO routines included in those famous 
packages are already parallelized for cluster machines, and 
are coordinated so that they exhibit efficient performance. 
However, if the program is used in distributed computing en- 
vironments, we should consider robustness and controllability 
in addition to the efficiency. Thus, grid-enabling is necessary. 
Among many ways of grid-enabling, we choose a strategy to 
reconfigure the program into a loosely-coupled form in order 
to satisfy such properties. 

B. Loosely-coupled FMO 

We have developed a grid-enabled version of FMO, called 
a Loosely-coupled (LC) FMO program [8], as part of the 
NAREGI project. At first, we show the procedure to change 
the GAMESS-FMO programs into a "loosely-coupled form," 
where the original FMO in the GAMESS package is divided 
into several "single task" modules which are connected each 
other through input/output file transfers. Those modules con- 
sists of run.ini for the initialization, run.mon for the 




Fig. 3. Flow of LC-FMO represented in NAREGI Workflow Tool 



fragment calculation (monomer), run . dim for the fragment 
pair calculation (dimer), and run. tot for the total energy 
calculation, and are invoked from a script program which also 
manages the file transfer over distributed computers. The total 
flow of the LC-FMO is shown in fig. [2 

Another benefit of the loosely-coupled form is the ex- 
tendibility of its functionality. In fact, after the first release 
of the LC-FMO, we have developed two other modules 
which realize a linkage to the initial density database and 
a visualization feature of the total molecular orbitals. Since 
the top-level program to invoke modules is lightweight and 
can be written in script languages, further modification of the 
total flow is easily performed. 

C. Execution on NAREGI Grid 

In order to execute LC-FMO on the NAREGI grid, we put 
the flow of the program into NAREGI Workflow Tool [3]. The 
graphical representation of the LC-FMO program is shown in 
fig. where modules and input/output files are represented 
in icons connected each other. This procedure is simple and 
straightforward because we have already reconstructed the 
program into a suitable form for distributed computing. Thus, 
the important is whether the basic structure of the program is 
grid-aware or not. 

Once a program is grid-enabled, we can execute it by 
the use of the large-scale computing resources over one 
thousand CPUs. From the programmer's points of view, the 
most preferable thing is that we are relieved of arranging 
the resources to execute the program in high efficiency. In 
NAREGI testbed system [3], NAREGI Super Scheduler can 
manage the resource rearrangement with the help of NAREGI 
Information Service. 

D. Efficiency of LC-FMO 

Robustness and efficiency are conflicting each other in 
general. Since our reconstruction of FMO to increase the 
controllability might hurt the efficient execution of the pro- 
gram, it is necessary to evaluate the efficiency of LC-FMO by 
comparing with the original FMO program in GAMESS. 

In table [H we show elapsed times for chicken egg white 
cystatin (1CEW, 1701 atoms, 106 fragments, shown in fig. HI) 
by the LC-FMO and the original GAMESS-FMO codes. This 
is obtained on the NAREGI testbed system which consists of 
16 CPUs of Xeon 3 GHz. Generally speaking, an increase of 
the total elapsed time is inevitable when we reconstruct some 



TABLE I 

The elapsed time for chicken egg white cystatin. 
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Fig. 4. The molecular structure of chicken egg white cystatin (1CEW, 1701 
atoms, 106 fragments). 



application into a loosely-coupled form, which is considered 
as a cost for grid-enabling. As shown in table ID however, the 
increase of the time is relatively small. Thus, it is evaluated 
that the LC-FMO is effectively executed on grids. 

III. Performance Prediction 

In this section, we predict what performance would be 
expected if we executed our FMO program in a "peta- scale 
computer." Since the actual hardware is not available at 
present, the prediction is based on a hypothetical specifications 
of the computer. 

A. Hypothetical Specification of the Peta-scale Computer 

It is said that, in a few years, such computer systems 
with peta-scale specifications will be designed [9]. Here, we 
concentrate on the CPU resources of the peta-scale computer, 
and estimate how many CPU resources are necessary to attain 
peta-scale performances. 

The current peak performance of a single-core CPU is 
the order of 10 giga flops. If a parallel computer with 10 
peta-flops peak performance is configured by the CPU to 
achieve 1 peta-flop effective performance, 1,000,000 CPU 
cores are necessary in total. Since cluster system constructed 
by 1,000,000 machines connected each other is unimaginable, 
it is necessary to rearrange those resources into multiple layers. 
We do not concern here how to realize computational nodes 
with multiple CPU cores. If we can configure a lowest layer 
by a node with the performance of 100 ~ 1,000 CPUs, the 
total computer will be a parallel system composed of 1 ,000 ~ 
10,000 nodes. 



TABLE II 

Timing data obtained in a single node of IBM p5 1 .9GHz. 



Input 


lcew 


lao6_half 


lao6 


lao6_dim 


No. Atom 

N f 

N d (N f ) 

N es (Nf) 
I'm 


1701 
106 
690 

4875 
17 


9121 
561 
4192 
152888 
17 


18242 
1122 
8416 
620465 
17 


36484 
2244 
16832 
2499814 
17 


Time 

(sec) 


monomer 
(Average) 


1356 
(0.752) 


13364 
(1.40) 


40005 
(2.10) 


140810 
(3.69) 


SCF-dimer 
(Average) 


2037 
(2.95) 


20689 
(4.94) 


70465 
(8.37) 


186901 
(H.l) 


ES -dimer 
(Average) 


398 
(0.0816) 


13772 
(0.0901) 


55955 
(0.0902) 


208627 
(0.0835) 


Elapsed Time (sec) 


3799 


47886 


166601 


536898 



TABLE III 

Timing data obtained in 16 CPUs of Xeon 3GHz. 



Input 


lcew 


lao6_half 


lao6 


No. Atoms 

Nf 

N d (Nf) 

N es (Nf) 
Im 


1701 
106 
690 

4875 
17 


9121 
561 
4192 
152888 
17 


18242 
1122 
8416 
620465 
17 


Time 

(sec) 


monomer 
(Average) 


1030.4 
(9.72) 


10808.9 
(19.3) 


33989.2 
(30.3) 


SCF-dimer 
(Average) 


1677.4 
(41.3) 


17517.6 
(71.0) 


50819.2 
(102.7) 


ES -dimer 
(Average) 


293.1 
(1.02) 


9594.8 
(1.07) 


39133.4 
(1.07) 


Elapsed Time (sec) 


3003.5 


38065.9 


126330.9 



B. Performance Prediction of FMO 

The procedure to predict the performance of FMO in the 
peta-scale computer is a somewhat phenomenological method 
based on actual measurements in current computer systems. 
First, we assume an execution model of FMO which gives a 
theoretical timing function of the number of fragments Nf. 
After fixing some parameters by several measurements of the 
program, we predict the elapsed time on the virtual computer 
with peta-scale specifications. 

1 ) Execution Model of GAMESS-FMO: Even if you could 
analyze all the program codes of GAMES S, it is almost 
impossible to determine the number of floating-point cal- 
culations in a FMO routine precisely since there are many 
conditional branches which depend on the molecular structure 
given as an input. Our strategy, here, is a practical approach to 
obtain phenomenological values of the execution time through 
experimental executions of FMO in the current machines. 

The total execution time of FMO is divided into three parts: 
a monomer part, an SCF-dimer part, and an ES-dimer part. The 
monomer and SCF-dimer parts perform SCF calculations for 
each fragments and each pair of fragments, respectively, while 
the ES-dimer part obtains dimer correction terms under an 
electro- static (ES) approximation between fragments. Usually, 
the ES approximation is applied to the fragment pair with 
fragments which are separated more than a certain threshold. 
We start from an expression of the total amount of computation 



TABLE IV 

Parameter values to represent the execution model of FMO 



(a) 



Parameter 


Value 


f (o) 

J m 


f (l) 
J m 


0.59 


0.0014 


AO) 


f (l) 
Jd 


2.83 


0.0039 


J es 




0.082 




^ibm 


-E'xeon 


1.0 


0.071 



as a function of the number of fragments Nf, 

FtoUN f ) = F m (N f ) + F d (N f ) + F^Nf), (1) 

where F m (Nf), Fd(Nf), and F es (Nf) represent the number 
of floating-point calculations for monomer, SCF-dimer, and 
ES-dimer parts, respectively. These are assumed as 



Fm(Nf) 

F d (N f ) 

Fes(Nf) 



HO) 

J m 
(0) 



fjgNesiNf), 



N f I m 
N d (N f ) 



(2) 
(3) 
(4) 



according to the algorithm of the FMO theory [4], where 
Ira is the number of monomer loops, Nd(Nf) and N es (Nf) 
represent the numbers of SCF-dimers and ES-dimers. SCF 
calculations for monomers and dimers in FMO depend on the 
environmental potential which is made by all the fragments 
other than the target monomer and dimer, while it can be 
ignored for ES-dimers. Thus, we represent the amount of each 
SCF calculation for a monomer and an SCF-dimer up to a 
linear term of Nf, and that of non-SCF calculation for an ES- 
dimer as a constant. The parameters, fffl, fm\ 
fes^ , should be determined from the actual timing data. If we 
define additional parameters, the number of parallel nodes K 
and the effective floating-point performance E (flops) of each 
node, we can compare the actual timing data with the amount 
of computation divided by KE. 

Table HJ and [Inl show the timing data of each part in test 
executions of several inputs on the machines, 1 node of IBM 
p5 1.9GHz and a 16 CPU cluster of intel Xeon 3GHz. Since 
the averaged size of fragments in these inputs are considered 
as almost the same, the difference of the elapsed time mainly 
depends on the number of the total fragments Nf. The least- 
square fitting with the additional parameters representing the 
effective performances of each computer, E^m and E^ eon , 
determines all the parameter values shown in table [IVl after 
the normalization i^ibm = 1. In figures [5] (a), (b) and (c), we 
plot the timing data and the results of the functions, (a) (fffl + 
f£>N f )/KE, (b) (^ 0) + f^N f )/KE 9 and (c) f^/KE. 

2 ) Performance in Peta-scale Computer: In order to predict 
the total performance of FMO, it is convenient if the functions 
Nd(Nf) and N es (Nf) are represented in simple functions of 
Nf. By the least-square fitting of the data in table HT1 and HTT1 
we obtain a function for the number of SCF-dimers 
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Fig. 5. The fragment number dependency of the amount of computation 
for (a) a monomer, (b) an SCF-dimer, and (c) an ES-dimer are plotted with 
the effective performance ratios (E = 1 for IBM p5, E = 0.071 for Xeon). 
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Nd(Nf) by the least-square method. 
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Fig. 7. Predicted elapsed time of the FMO calculation by GAMESS 
as a function of the number of fragments Nf. This is obtained on 
the assumption that 10,000 nodes are available, and each node is 5 
times faster than a current machine. 
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Fig. 8. Multi-physics/multi-scale simulator stack including OpenFMO. 



If we consider a constraint 

N d (N f ) + N es (N f ) = (iV/ " 2 1)iV/ , (6) 
the number of ES-dimers is represented in the form 

Nes(N f ) = (iV/ ~ 1)iV/ - 7.50 Nf. (7) 

If we substitute these functions and parameters into Eqs. 

0, ([3]), and ©, the total computational amount of FMO 
is obtained as a function of Nf. As already seen in the 
previous section, the FMO algorithm is appropriate for parallel 
executions when the number of fragments is large enough 
compared to the number of available nodes. Suppose that we 
have a peta-scale computer with K = 10000 and E = 5, 

1. e., the number of available nodes is 10,000 and each node 
has an effective speed 5 times faster than a node of IBM p5 
1.9GHz which is used in this paper as a reference machine 
(^ibm = 1). Then, the predicted elapsed time F tota \(Nf)/ KE 
is calculated as a quadratic function of Nf shown in fig. 
[71 From the viewpoint of the elapsed time, we can perform 
quantum calculations for molecules with more than 100,000 
fragments if the peta-scale computer is realized. 

According to the effective performance measurements on 
PCs, The FMO calculation is executed in 0.5 ~ 1.0 giga flops 
per one CPU of Xeon or Pentium4, which means that our 
reference machine, a node of IBM p5 1.9GHz, exhibits almost 
10 giga flops for the program since it is ^ibm/^xeon ~ 14 
times faster than a Xeon. Then, the total performance of 
FMO calculations for a peta computer with K = 10000 
and E = 5 is considered as 0.5 peta flops. Thus, FMO is 
considered as a predominant candidate which can record peta- 
scale performance. 

IV. OpenFMO Project 

In spite of the fact that the FMO is a suitable algorithm 
to achieve the peta-scale performance through large-scale 
parallel executions, the actual GAMESS -FMO code cannot be 
executed for the molecule with more than 100,000 fragments. 
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Fig. 9. Can multi-physics/multi-scale simulations reveal the true aspect of 
the complex world of matter? 



One of the reason why the current program fails to run is 
memory consumption in each node, where the current FMO 
code in GAMESS tries to allocate two-dimensional arrays 
with respect to fragments, e.g., the distance between two 
fragments. This exceeds 40 giga bytes even if we consider the 
symmetry. The limit of the number of fragment is considered 
as the order of 10,000 in the current implementation. Thus, 
the reconstruction of the FMO code is necessary in order to 
correspond to the peta-scale computing. 

In this section, we introduce our new project to reconstruct 
a FMO program from scratch. The name of this project, 
OpenFMO [10], stands for the following openness: (1) names 
and argument lists of APIs constructing the FMO program 
are opened publicly, (2) the program structure is also opened 
to combining other theories of physics and chemistry, i.e., 
multi-scale/multi-physics simulations, and (3) the skeleton 
program is developed under some open- source licenses and 
its development process will also be opened to the public. In 
fig. [U we show a stack structure of this program. Although 
the main target is still a quantum chemical calculation of 
molecules, it can be combined with other theories to construct 
multi-physics/multi-scale simulations (fig. [9]) for complex phe- 
nomena [11]. 



A. Open Architecture Implementation of FMO 

As already shown in section HI-Al the fragment MO method 
consists of the standard ab initio MO calculations including 
the generation of Fock matrices with one/two-electron integral 
calculations. Using this property, the usual FMO program is 
divided into two layers, the skeleton program which control 
the whole flow of the FMO algorithm, and the molecular 
orbital (MO) APIs to provide the charge distribution of each 
fragment through ab initio MO calculations. Since these spec 
of interfaces between the skeleton and the APIs are fixed and 
opened publicly, either of them can easily be substituted by 
other programs. 

B. Open Interface to Multi-physics Simulations 

The multi-physics simulation is one of the predominant 
strategy to construct peta-performance applications for nano- 
scale materials. Our new implementation of FMO can also be 
opened for such multi-physics simulations. Since the FMO 
method is based on electro- static interaction between frag- 
ments, we can extend each fragment to the general object 
which can provide a static charge distribution. For examples, 
we often use molecular mechanics representation of atomic 
clusters which should be given by quantum mechanical de- 
scription, or the environmental charge distribution surrounding 
a molecule which models a solvent can be included as a 
fragment. Thus, the MO- APIs for a certain fragment can be 
substituted to other programs based on the different approxi- 
mation levels. 

In addition to the description in molecular sciences, this 
is extended to larger- scale simulations through the "multi- 
physics/multi-scale simulator" layer. This type of extension 
is widely used for the large-scale computation representing 
realistic models of molecules in cells or other living organisms. 

C. Open Source Development of the Programs 

The source code of the skeleton program of OpenFMO is 
publicly opened according to some open- source licenses. At 
present, the OpenFMO project is managed by several core 
members including the authors of this paper. Once we show 
the effectiveness of our approach to the open implementation, 
and the direction of the project is settled properly, we are 
willing to change the management of the project into so-called 
open- source- software developments . 

V. Summary and Discussion 

In this paper, we showed our results in nanoscience executed 
on the NAREGI grid system, and expressed our perspective 
toward the peta- scale computing. In section [TTT1 we showed 
that FMO is one of the peta- scale applications. However, it 
has been also realized that the actual implementations of the 
FMO method, at present, does not correspond to the execution 
in peta-scale environments, i.e., they does not solve molecules 
with more than 10 thousand fragments even if peta-scale 
computers are available. In order to improve the situation, we 
started an open- source project for multi-physics simulations 
including new FMO codes from scratch. 



Generally speaking, it is very difficult for scientific appli- 
cations to be executed in a peta-scale performance since a 
simple parallel scheme fails in case that the total number of 
CPUs is the order of 1,000,000. Thus, it will be necessary 
that the application itself becomes multi-layered in order to 
use the resources efficiently corresponding to the hardware 
architecture of peta-scale computers. The FMO calculation 
which we have studied in this paper has a two- stage structure 
in its original algorithm: ab initio calculations are performed 
for each fragment and fragment pair while the interactions 
between these fragments are described by a classical electro- 
static potentials. This theoretical reconstruction of the original 
ab initio MO method into the layered form is considered as 
the key to the successful performance of the program. 

Scientific calculations with layered structures have been 
carried out as a multi-physics/multi-scale simulation in various 
fields. This type of calculations is important not only as a 
technique of realistic simulations but also as an actual example 
for the peta-scale computing. However, we claim that the 
deep understanding of the physical/chemical theories must be 
necessary for a proper construction of effective simulations 
which describes complex aspects of nature. Then, the the- 
oretical and practical way of constructing such simulations 
should be established before the time when the peta-scale 
computers are available. We expect that our implementation 
of the simulator is one of the solutions for high-performance 
scientific simulations in the next-generation. 
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